(C) 2017-2020 by Damir Cavar
Download: This and various other Jupyter notebooks are available from my GitHub repo.
This is a tutorial related to the discussion of grammar engineering and parsing in the class Alternative Syntactic Theories and Advanced Natural Language Processing taught at Indiana University in Spring 2017 and Fall 2018.
The following examples are taken from the NLTK parsing HOWTO page.
In [1]:
from nltk import Nonterminal, nonterminals, Production, CFG
In [2]:
nt1 = Nonterminal('NP')
nt2 = Nonterminal('VP')
In [3]:
nt1.symbol()
Out[3]:
In [4]:
nt1 == Nonterminal('NP')
Out[4]:
In [5]:
nt1 == nt2
Out[5]:
In [6]:
S, NP, VP, PP = nonterminals('S, NP, VP, PP')
print(S.symbol())
In [7]:
N, V, P, DT = nonterminals('N, V, P, DT')
In [8]:
prod1 = Production(S, [NP, VP])
In [9]:
prod2 = Production(NP, [DT, NP])
In [10]:
prod1.lhs()
Out[10]:
In [11]:
prod1.rhs()
Out[11]:
In [12]:
prod1 == Production(S, [NP, VP])
Out[12]:
In [13]:
prod1 == prod2
Out[13]:
In [14]:
grammar = CFG.fromstring("""
S -> NP VP
PP -> P NP
PP -> P NP
NP -> 'the' N | N PP | 'the' N PP
VP -> V NP | V PP | V NP PP
N -> 'cat'
N -> 'fish'
N -> 'aligator'
N -> 'dog'
N -> 'rug'
N -> 'mouse'
V -> 'chased'
V -> 'sat'
P -> 'in'
P -> 'on'
""")
In [15]:
print(grammar)
One can build complex feature structures using the following strategies:
In [16]:
import nltk
fstr = nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]")
print(fstr)
Creating shared paths is also possible:
In [17]:
fstr2 = nltk.FeatStruct("""[NAME='Lee', ADDRESS=(1)[NUMBER=74, STREET='rue Pascal'],
SPOUSE=[NAME='Kim', ADDRESS->(1)]]""")
print(fstr2)
Let us create feature structures and try out unification:
In [21]:
fs1 = nltk.FeatStruct("[AGR=[PER=3, NUM='pl', GND='fem'], POS='N']")
fs2 = nltk.FeatStruct("[POS='N', AGR=[PER=3, GND='fem']]")
print(fs1.unify(fs2))
The following examples are taken from the NLTK parsing HOWTO page.
In [22]:
import nltk
In [23]:
nltk.parse.chart.demo(2, print_times=False, trace=1,
sent='I saw a dog', numparses=1)
This is an example how to apply top-down parsing:
In [24]:
nltk.parse.chart.demo(1, print_times=True, trace=0,
sent='she killed the man with the tie', numparses=2)
This is how to apply bottom-up parsing:
In [25]:
nltk.parse.chart.demo(2, print_times=False, trace=0,
sent='I saw John with a dog', numparses=2)
In [ ]:
nltk.parse.featurechart.demo(print_times=False,
print_grammar=True,
parser=nltk.parse.featurechart.FeatureChartParser,
sent='I saw John with a dog')
We will need the following NLTK modules in this section:
In [26]:
import nltk
from nltk import CFG
from nltk.grammar import FeatureGrammar as FCFG
We can load a grammar from a file, that is located in the same folder as the current Jupyter notebook, in the following way:
In [27]:
cfg = nltk.data.load('spanish1.cfg')
print(cfg)
We instantiate a ChartParser object with this grammar:
In [28]:
cp1 = nltk.parse.ChartParser(cfg)
The ChartParser object has a parse-function that takes a list of tokens as a parameter. The token list can be generated using a language specific tokenizer. In this case we simply tokenize using the Python-function split on strings. The output of the parse function is a list of trees. We loop through the list of parse trees and print them out:
In [29]:
"los mujeres adoran la Lucas".split()
Out[29]:
In [30]:
for x in cp1.parse("los mujeres adoran la Lucas".split()):
print(x)
We can also edit a grammar directly:
In [32]:
cfg2 = CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> 'the' N | N PP | 'the' N PP
VP -> V NP | V PP | V NP PP
N -> 'cat'
N -> 'dog'
N -> 'bird'
N -> 'rug'
N -> 'woman'
N -> 'man'
N -> 'tie'
V -> 'chased'
V -> 'killed'
V -> 'sat'
V -> 'bit'
P -> 'in'
P -> 'on'
P -> 'with'
""")
We parse our example sentences using the same approach as above:
In [33]:
cp2 = nltk.parse.ChartParser(cfg2)
for x in cp2.parse("the woman killed the man with the tie".split()):
print(x)
The previous example included a Context-free grammar. In the following example we load a Context-free Grammar with Features, instantiate a FeatureChartParser, and loop through the parse trees that are generated by our grammar to print them out:
In [34]:
fcfg = nltk.data.load('spanish1.fcfg')
fcp1 = nltk.parse.FeatureChartParser(fcfg)
for x in fcp1.parse(u"Miguel adoró el gato".split()):
print(x)
We can edit a Feature CFG in the same way directly in this notebook and then parse with it:
In [ ]:
fcfg2 = FCFG.fromstring("""
% start CP
# ############################
# Grammar Rules
# ############################
CP -> Cbar[stype=decl]
Cbar[stype=decl] -> IP[+TNS]
IP[+TNS] -> DP[num=?n,pers=?p,case=nom] VP[num=?n,pers=?p]
DP[num=?n,pers=?p,case=?k] -> Dbar[num=?n,pers=?p,case=?k]
Dbar[num=?n,pers=?p] -> D[num=?n,DEF=?d,COUNT=?c] NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p] -> NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p,case=?k] -> D[num=?n,pers=?p,+DEF,type=pron,case=?k]
NP[num=?n,pers=?p,COUNT=?c] -> N[num=?n,pers=?p,type=prop,COUNT=?c]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=1]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=2] DP[case=acc]
PP -> P DP[num=?n,pers=?p,case=acc]
#PP -> P DP[num=?n,pers=?p,case=dat]
#
# ############################
# Lexical Rules
# ############################
D[-DEF,+COUNT,num=sg] -> 'a'
D[-DEF,+COUNT,num=sg] -> 'an'
D[+DEF] -> 'the'
D[+DEF,gen=f,num=sg,case=nom,type=pron] -> 'she'
D[+DEF,gen=m,num=sg,case=nom,type=pron] -> 'he'
D[+DEF,gen=n,num=sg,type=pron] -> 'it'
D[+DEF,gen=f,num=sg,case=acc,type=pron] -> 'her'
D[+DEF,gen=m,num=sg,case=acc,type=pron] -> 'him'
N[num=sg,pers=3,type=prop] -> 'John' | 'Sara' | 'Mary'
V[tns=pres,num=sg,pers=3,val=2] -> 'loves' | 'calls' | 'sees' | 'buys'
N[num=sg,pers=3,-COUNT] -> 'furniture' | 'air' | 'justice'
N[num=sg,pers=3] -> 'cat' | 'dog' | 'mouse'
N[num=pl,pers=3] -> 'cats' | 'dogs' | 'mice'
V[tns=pres,num=sg,pers=3,val=1] -> 'sleeps' | 'snores'
V[tns=pres,num=sg,pers=1,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=sg,pers=2,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=pl,val=1] -> 'sleep' | 'snore'
V[tns=past,val=1] -> 'slept' | 'snored'
V[tns=pres,num=sg,pers=3,val=2] -> 'calls' | 'sees' | 'loves'
V[tns=pres,num=sg,pers=1,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=sg,pers=2,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=pl,val=2] -> 'call' | 'see' | 'love'
V[tns=past,val=2] -> 'called' | 'saw' | 'loved'
""")
We can now create a parser instance and parse with this grammar:
In [ ]:
fcp2 = nltk.parse.FeatureChartParser(fcfg2, trace=1)
sentence = "John buys the furniture"
result = list(fcp2.parse(sentence.split()))
if result:
for x in result:
print(x)
else:
print("*", sentence)
Countable nouns and articles in a DP:
In [ ]:
DPs and pronouns
In [ ]:
CP/IP sentence structures
In [ ]:
This is a list of the different Feature Parsers in NLTK.
I do not know whether this is an exhaustive list.
(C) 2017-2018 by Damir Cavar - Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)